\[ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \newcommand{\pmacro}{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\res{e} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \]


Preliminary comments:


library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.7
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)

1 Fitting a linear model (*)

The file sales1.csv consists of quarterly sales volumes (in % and indexed to the time 0) of a product.

  1. Plot the data
don <- read.csv("salesData/sales1.csv")
p1 <- don %>% 
  ggplot() +
  aes(x = time, y = y) +
  geom_point() +
  geom_line() +
  xlab("time") +
  ylab("percentage of sales") +
  ggtitle("Sales throughout time") +
  theme_bw()
p1

  1. Fit a polynomial model to this data (justify the choice of the degree). What do the residuals suggest?

After plotting the data we don’t recognize any higher order polynomial function, it actually looks like an affine function that we could resume with a straight line. We will hence fit a degree 1 polynomial to the data:

lm1 <- lm(y ~ time, data = don)
pred1 <- predict(lm1)
don <- data.frame(don,pred1)
summary(lm1)
## 
## Call:
## lm(formula = y ~ time, data = don)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6994 -1.5777 -0.3596  1.6444  3.4747 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  99.99688    0.94971  105.29  < 2e-16 ***
## time          0.37985    0.02643   14.37 1.17e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.2 on 19 degrees of freedom
## Multiple R-squared:  0.9158, Adjusted R-squared:  0.9113 
## F-statistic: 206.5 on 1 and 19 DF,  p-value: 1.166e-11
p2 <- p1 +
  geom_line(data = don, aes(x = time, y = pred1), color = "red") +
  theme_bw()
p2

par(mfrow=c(2,2))
plot(lm1)

time variable is statistically significant according to the t-value, as well as the intercept. From the F-statistic p-value we see that adding the varaible time is better than an intercept only model and the adjusted R^2 is pretty high. Diagnostic plots confirm gaussian linear model hypothesises (equal variance of residuals, normal distribution of residuals). Leverage and influence seems to be acceptable from cook’s distance point of view.

  1. Try to improve the model by adding a periodic component (little reminder: $ (2t/T) $ and \(\sin(2\pi t/T)\) are periodic functions of period \(T\)). Write your final model as a mathematical equation.

Here time is measured in months, we could thus try to add periodicity through years and aggregate time on 12. We’ll use the \(\sin(2\pi t/T)\) and add it to our model:

## Using cos:

y = ß0 + ß1t(i) + ß2cos(Zpit(i)/T) + e(i))

lm2 <- lm(y ~ time + (cos((2*pi*time)/12)), data = don)
pred2 <- predict(lm2)
don <- data.frame(don,pred2)
p3 <- p1 +
  geom_line(data = don, aes(x = time, y = pred2), color = "red") +
  theme_bw()
p3

summary(lm2)
## 
## Call:
## lm(formula = y ~ time + (cos((2 * pi * time)/12)), data = don)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1087 -1.3212 -0.1863  1.5072  3.4641 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             100.03019    0.87687 114.076  < 2e-16 ***
## time                      0.37706    0.02444  15.430 8.03e-12 ***
## cos((2 * pi * time)/12)   1.28801    0.62150   2.072   0.0529 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.031 on 18 degrees of freedom
## Multiple R-squared:  0.932,  Adjusted R-squared:  0.9244 
## F-statistic: 123.3 on 2 and 18 DF,  p-value: 3.116e-11
par(mfrow=c(2,2))
plot(lm2)

The new periodic term is just statistically significant at the sacrosanct level of 0.5 according to the t-value, all the other variables remained highly significant. From the F-statistic p-value we see that adding this second cosinus periodic term is better than an intercept only model and the adjusted R^2 is higher than the previous model. Diagnostic plots confirm gaussian linear model hypothesises (equal variance of residuals, normal distribution of residuals). Leverage and influence seems to have improved compared to the previous model.

## Using sinus:

y = ß0 + ß1t(i) + ß2sin(Zpit(i)/T) + e(i))

lm3 <- lm(y ~ time + (sin((2*pi*time)/12)), data = don)
pred3 <- predict(lm3)
don <- data.frame(don,pred3)
p4 <- p1 +
  geom_line(data = don, aes(x = time, y = pred3), color = "red") +
  theme_bw()
p4

summary(lm3)
## 
## Call:
## lm(formula = y ~ time + (sin((2 * pi * time)/12)), data = don)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6552 -0.6945  0.4861  0.6453  2.1662 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              99.6598     0.5896  169.04  < 2e-16 ***
## time                      0.3889     0.0164   23.71 5.02e-15 ***
## sin((2 * pi * time)/12)   2.4069     0.4267    5.64 2.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.359 on 18 degrees of freedom
## Multiple R-squared:  0.9696, Adjusted R-squared:  0.9662 
## F-statistic: 286.6 on 2 and 18 DF,  p-value: 2.246e-14
par(mfrow=c(2,2))
plot(lm3)

The new sinus periodic term is statistically significant according to the t-value, as well as the intercept and time. From the F-statistic p-value we see that adding this second model is better than an intercept only model and the adjusted R^2 is higher than all our other previous models. Diagnostic plots confirm gaussian linear model hypothesises (equal variance of residuals, normal distribution of residuals) although it seems that the residuals aren’t as evenly distributed as before. Nevertheless they remain acceptable. Leverage and influence seems to have improved compared to the degree 1 polynomial model.

Although there are some improvements we could try using both terms in the model to see if they yield something better:

1.1 Cosinus & Sinus

y = ß0 + ß1t(i) + ß2sin(Zpit(i)/T) + ß3cos(Zpi*t(i)/T) + e(i))

lm4 <- lm(y ~ time + (sin((2*pi*time)/12)) + (cos((2*pi*time)/12)), data = don)
pred4 <- predict(lm4)
don <- data.frame(don,pred4)
p5 <- p1 +
  geom_line(data = don, aes(x = time, y = pred4), color = "red") +
  theme_bw()
p5

summary(lm4)
## 
## Call:
## lm(formula = y ~ time + (sin((2 * pi * time)/12)) + (cos((2 * 
##     pi * time)/12)), data = don)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7557 -0.4608  0.1007  0.8127  1.2649 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             99.69815    0.44996 221.571  < 2e-16 ***
## time                     0.38611    0.01254  30.797 2.36e-16 ***
## sin((2 * pi * time)/12)  2.35218    0.32593   7.217 1.44e-06 ***
## cos((2 * pi * time)/12)  1.18481    0.31757   3.731  0.00166 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.037 on 17 degrees of freedom
## Multiple R-squared:  0.9833, Adjusted R-squared:  0.9803 
## F-statistic: 332.9 on 3 and 17 DF,  p-value: 2.716e-15
par(mfrow=c(2,2))
plot(lm4)

All terms are statistically significant and the R-squared has even increased further. We can also see an improvement on equal variance and indepently distributed residuals from our previous model.

To be sure which model is best untill now, we can compare all 4 with the BIC and AIC criterions for these models:

BIC(lm1,lm2,lm3,lm4)
##     df      BIC
## lm1  3 99.74920
## lm2  4 98.30006
## lm3  4 81.41808
## lm4  5 71.90108
AIC(lm1,lm2,lm3,lm4)
##     df      AIC
## lm1  3 96.61563
## lm2  4 94.12197
## lm3  4 77.23999
## lm4  5 66.67847

Seems that the most adequate model according to both criterions is the sinusoïdal and cosinus periodic term model.

  1. Plot on a same graph the observed sales together with the predicted sales given by your final model. What do you think about this model? What about the residuals?
p5

par(mfrow=c(2,2))
plot(lm4)

Again as I mentionned before, the terms in our model are individually statistically significant according to the t-test, the adjusted R^2 is highest among our models and both the AIC and BIC criterions confort us in our choice of this model. Concerning residuals, despite a slight deviation of the equal variance and independently distributed residuals assumptions of the guassian linear model, the model is valid remains valid. So the model seems to fit our data correctly, but I’m afraid of overfitting.

  1. We want the predicted sales volume to be equal to 100 at time 0. Modify your final model in order to take this constraint into account.

We get rid of the previous intercept and force it to be equal to 100: y(i) = f(time(i)) + e(i) with: f(time) = 100 + ß1time + ß2sin(2pitime/12) + ß3(cos(2pi*time/12)-1)

lm5 <- lm(y -100 ~ -1 + time + I(sin((2*pi*time)/12)) + I(cos((2*pi*time)/12)-1), data = don)
pred5 <- predict(lm5, don) +100
don <- data.frame(don,pred5)
p6 <- don %>% 
  ggplot() +
  aes(x = time, y = y) %>% 
  geom_point() +
  geom_line(aes(x = time, y = pred5), color = "red") +
  theme_bw()
p6

summary(lm5)
## 
## Call:
## lm(formula = y - 100 ~ -1 + time + I(sin((2 * pi * time)/12)) + 
##     I(cos((2 * pi * time)/12) - 1), data = don)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9858 -0.4548  0.2311  0.8582  1.7732 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## time                           0.400663   0.008894  45.049  < 2e-16 ***
## I(sin((2 * pi * time)/12))     2.408199   0.337386   7.138 1.19e-06 ***
## I(cos((2 * pi * time)/12) - 1) 0.888120   0.267237   3.323  0.00378 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.08 on 18 degrees of freedom
## Multiple R-squared:  0.9948, Adjusted R-squared:  0.9939 
## F-statistic:  1138 on 3 and 18 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm5)

To check that the intercept is indeed set to 100:

NEW_PT <- data.frame(time = 0)
predict(lm5,NEW_PT) + 100
##   1 
## 100


2 Fitting a linear mixed effects model (**)

The file sales30.csv now consists of quarterly sales volumes (still in % and indexed to the time 0) of 30 different products.

  1. Plot the data
don2 <- read.csv("salesData/sales30.csv")
p1 <- don2 %>% 
  ggplot() +
  aes(x = time, y = y, color = as.factor(id)) +
  geom_point() +
  xlab("time") +
  ylab("sales") +
  ggtitle("quarterly sales for 30 different products")
p1

  1. Fit the model used previously for fitting the first series to this data and comment the results.
lm6 <- lm(y ~ time + (sin((2*pi*time)/12)) + (cos((2*pi*time)/12)),
          data = don2)
pred6 <- predict(lm6)
don2 <- data.frame(don2,pred6)
p2 <- p1 + 
  geom_line(data = don2, aes(x = time, y = pred6)) +
  facet_wrap(~id)
p2

summary(lm6)
## 
## Call:
## lm(formula = y ~ time + (sin((2 * pi * time)/12)) + (cos((2 * 
##     pi * time)/12)), data = don2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.7030  -2.9607   0.1117   3.3645  17.4116 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             100.09336    0.42471 235.676  < 2e-16 ***
## time                      0.21655    0.01183  18.299  < 2e-16 ***
## sin((2 * pi * time)/12)   1.24348    0.30764   4.042 5.96e-05 ***
## cos((2 * pi * time)/12)   0.89834    0.29975   2.997  0.00283 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.36 on 626 degrees of freedom
## Multiple R-squared:  0.3634, Adjusted R-squared:  0.3603 
## F-statistic: 119.1 on 3 and 626 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lm6)

Here we do not take into account randomness in the intercept for each model, which explains why sometimes the fit is way above or way under the actual datapoints.

  1. Fit a mixed effect model to this data (discuss the choice of fixed and random effects). Write your final model as a mathematical equation.
library(lme4)
don2 <- don2 %>% 
  mutate(cos_per = cos((2*pi*time)/12)) %>% 
  mutate(sin_per = sin((2*pi*time)/12))

There are various ways we can introduce randomness in our model, we’ll proceed sequentially:

2.1 Intercept random effects

This was our first intuition when looking at the plot above

lmr1 <- lmer(y ~ (1|id) + time + sin_per + cos_per, data = don2)

Seems there could also be also randomness on the slope when looking closer!

2.2 Intercept & Slope random effects

lmr2 <- lmer(y ~ (time|id) + sin_per + cos_per, data = don2)

2.3 Sinusoidal random effects

2.3.1 with intercept randomness

lmr3 <- lmer(y ~ (time|id) +
               sin_per + 
               cos_per + 
               (cos_per + time|id),
             data = don2)

2.3.2 without

lmr4 <- lmer(y ~ time + sin_per + (time + sin_per | id) + cos_per, data = don2)

2.4 Cosinus random effects

2.4.1 with intercept randomness

lmr5 <- lmer(y ~ (time|id) +
               cos_per + 
               (time + cos_per | id) + 
               sin_per,
             data = don2)

2.4.2 without

lmr6 <- lmer(y ~ time +
               cos_per + 
               (time + cos_per | id) +
               sin_per,
             data = don2)

2.5 Periodic element random effects:

2.5.1 with intercept randomness

lmr7 <- lmer(y ~ (time|id) +
               cos_per +
               (time + cos_per | id) +
               (time + sin_per | id),
             data = don2)

2.5.2 without

lmr8 <- lmer(y ~ time +
               cos_per +
               (time + cos_per | id) +
               (time + sin_per | id),
             data = don2)

Now to compare all these models let’s use BIC and AIC criterion:

AIC(lmr1,lmr2,lmr3,lmr4,lmr5,lmr6,lmr7,lmr8)
##      df      AIC
## lmr1  6 3457.001
## lmr2  7 3021.690
## lmr3 13 3032.737
## lmr4 11 2156.560
## lmr5 13 3032.970
## lmr6 11 2997.041
## lmr7 18 2210.082
## lmr8 16 2168.809
BIC(lmr1,lmr2,lmr3,lmr4,lmr5,lmr6,lmr7,lmr8)
##      df      BIC
## lmr1  6 3483.675
## lmr2  7 3052.810
## lmr3 13 3090.531
## lmr4 11 2205.463
## lmr5 13 3090.764
## lmr6 11 3045.944
## lmr7 18 2290.105
## lmr8 16 2239.940

There is a lot of other combinations possible, but at first hand we can see that the model taking into account random effects for the sinusoïdal periodic term without randomness on the intercept really stands out. We can check if removing correlation between random effects can further improve our best model here

lmr9 <- lmer(y ~ time + sin_per + ( -1 + time + sin_per | id) + cos_per, data = don2)
BIC(lmr9,lmr4)
##      df      BIC
## lmr9  8 2186.126
## lmr4 11 2205.463
AIC(lmr9,lmr4)
##      df     AIC
## lmr9  8 2150.56
## lmr4 11 2156.56

We will therefore consider the following model:

y(i,t) = ß0 + ß1(i)time(i,t) + ß2(i)sin(2pitime(i,t)/12) + ß3(cos(2pi*time(t)/12)-1) + e(i,t)

  1. Plot the data with the predicted sales given by your final model.
pred9 <- predict(lmr9)
don2 <- data.frame(don2,pred9)
p9 <- p1 + 
  geom_line(data = don2, aes(x = time, y = pred9)) +
  facet_wrap(~id)
p9

Indeed the models seem to fit better than our first try.

  1. How could you take into account the previous constraint (predicted sales volume are all equal to 100 at time 0)?

We use the same model as before taking into account this constraint and the different ids:

y(i,t) = 100 + ß1(i)time(i,t) + ß2(i)sin(2pitime(i,t)/12) + ß3(cos(2pi*time(t)/12)-1) + e(i,t)

  • The model:
lmr10 <- lmer(y -100 ~ -1 +
                time +
                I(sin((2*pi*time)/12)) +
                (-1 + time + I(sin((2*pi*time)/12)) || id) +
                I(cos((2*pi*time)/12)-1),
              data = don2)
  • Plotting our results:
pred10 <- predict(lmr10,don2) + 100
don2 <- data.frame(don2,pred10)
p10 <- p1 +
  geom_line(data = don2, aes(x = time, y=pred10, color = as.factor(id))) +
  theme_bw()
p10


3 Individual prediction (***)

The file salesNew.csv consists of quarterly sales volumes of another product.

The final model of part 2 will be used here. In other words, you should not use the new data to fit any new model.

  1. Suppose first that we don’t have any data for this product (although data are available for this product, we act as if we do not know them). How can we predict the sales volumes for this product? plot the data and the prediction on a same graph.
don3 <- read.csv("salesData/salesNew.csv")
lmr <- lmer(y ~ time + cos_per + sin_per + (-1 + time + sin_per || id), don2)

If we don’t have any new observations X we must infer our prediction from y_hat yielded by our previous model.

t <- seq(0,66,0.2)
X <- cbind(1,t,cos(2*pi*t/12)-1, sin(2*pi*t/12))
beta <- fixef(lmr)
pred <- data.frame(time=t, pred0=X %*% beta)
p13 <- ggplot(don3, aes(x = time, y = y)) +
  geom_point() + 
  xlab("time") + 
  ylab("sales") +
  geom_line(data = pred, aes(time, pred0), color = "red")
p13

The results aren’t astonishing, the curve is off compared to our datapoints!

  1. Suppose now that only the first data at time 1 is available for this product. Compute and plot the new predictions.

  2. Repeat the same process with an increasing number of observed data. Comment the results.

I’ll be answering these two questions in one

This is what we call a-priori information, thanks to this we can access the so called Bayesian estimators which use conditional laws of X and Y to retrieve a supposedly more accurate estimator. In our case we can consider the MAP estimator we’ve seen in class.

summary(lmr)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ time + cos_per + sin_per + ((0 + time | id) + (0 + sin_per |  
##     id))
##    Data: don2
## 
## REML criterion at convergence: 2137.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.04054 -0.60956  0.02221  0.60432  3.06362 
## 
## Random effects:
##  Groups   Name    Variance Std.Dev.
##  id       time    0.01878  0.137   
##  id.1     sin_per 8.42646  2.903   
##  Residual         1.03600  1.018   
## Number of obs: 630, groups:  id, 30
## 
## Fixed effects:
##              Estimate Std. Error  t value
## (Intercept) 100.09336    0.08064 1241.178
## time          0.21655    0.02512    8.621
## cos_per       0.89834    0.05692   15.784
## sin_per       1.24348    0.53319    2.332
## 
## Correlation of Fixed Effects:
##         (Intr) time   cos_pr
## time    -0.077              
## cos_per  0.023 -0.005       
## sin_per -0.011  0.001 -0.005
Sig <- diag(c(0.01878,8.42647)) 
sig <- solve(Sig)
res <- 1.036
t <- seq(1,61,3)
pts <- seq(1,20,3)
for (j in pts){
  betaj <- beta
  yj <- don3$y[1:j]
  tj <- t[1:j]
  Xj <- cbind(1, tj, cos(2*pi*tj/12)-1, sin(2*pi*tj/12))
  Aj <- cbind(tj, sin(2*pi*tj/12))
  coefj <- solve(t(Aj) %*% Aj/res + sig)
  coeffj <- coefj %*% t(Aj) %*% (yj - Xj %*% beta)/res
  betaj["time"] <- beta["time"] + coeffj[1]
  betaj["sin_per"] <- beta["sin_per"] + coeffj[2]
  pred$predj <- X %*% betaj
  print(p13 +
    geom_line(data = pred, aes(x = time, y = predj), color = "green"))
}

We can see the fit improving as we increase the number of data points we integrate in our estimation, which was to be expected.